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Abstract — The numerous missing value computation 
approaches for yeast data have been suggested in the 
literature. Throughout the past few years, investigators 
are keen on driving a lot of research effort on giving 
methodical assessments of the dissimilar computation 
procedures. The problem of controlling the missing 
values are designed with samples of tough 
microorganisms, such as yeast. Expensive strategies are 
present which has targeted to develop a varied collection 
of samples. They are regularly in effect for concurrently 
disturbing various small samples, but are greatly lesser 
effective for larger samples. The manufactured devices 
highlight interference rates after these minor samples 
having 5% of cells interrupted in 2 to 38 seconds range, 
frequently ignoring to indicate the organism interrupted 
or the small sample size. At the outset, maximum 
procedures continued to be evaluated by means of 
highlighting on the accuracy of the computation, using 
metrics such as the Correlation (uncentered), Correlation 
(centered). Absolute correlation (uncentered), Absolute 
correlation (centered), Spearman Rank correlation, 
Kendall’s tau, Euclidean distance and City block 
distance. This proves the best clustering range. In the 
proposed approach running time is also computed for the 
various used methods using the same above mentioned 
metrics. On the other hand, it has turn out to be strong 
that the attainment of the accuracy and running time of 
the whole yeast gene data had a better assessment in 
further applied relations by way of hierarchical 
clustering approach. Accuracy and running time are 
sorted out for both large and small samples once after 
computing the missing values. Running times of the 
different clustering methods in a yeast dataset are 
existing in the work for the missing value rate of 4%. The 
hierarchical clustering was the fastest among the 
specified clustering methods (K-Means (gene) clustering 
technique, Self-Organized Mapping and Principle 


Component Analysis). However, the SOM was still about 
10 times faster than k means. The running time of the 
original hierarchical method was about one third for that 
of its proposed version. 

Keywords — Cluster, Yeast data, Hierarchical 

clustering, k means clustering, filtering data. 

I. INTRODUCTION 

The greatest evidence result of small sample size, does 
not affect the quantification procedure. The whole yeast 
gene data are processed in the similar way from both 
small and large sample size. The missing value in the 
yeast gene data indications are visualized by reducing the 
dimensionality with hierarchical clustering approach. The 
objective of the research involves predicting the missing 
values and it is an essential step to determine missing 
values in microarray data as the whole dataset is 
necessary in several expression profile analysis in 
bioinformatics. Surely, any individual approach to 
confirm the investigation procedure of the microarray 
data with missing values is to repeat the computation, and 
evidently it is very costly and time consuming. Uniquely, 
one can be able to reflect, for instance, the capability 
clustering methods such as single linkage, complete 
linkage, average linkage and centroid linkage. These 
clustering methods of hierarchical clustering approach 
allows the dataset to preserve the important yeast gene 
data in the dataset, or its discriminative/predictive 
influence for classification/clustering determinations 
The K-Means (gene) clustering technique, Self-Organized 
Mapping and Principle Component Analysis algorithms 
were clearly the slowest computation methods. The 
hierarchical clustering method made, on unusual case, 
assesses for missing values which were up to 4 times 
larger than the original values. This appears to put 
forward an inconsistency in the method's employment or 
process. 
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Integrative Missing Value Assessment through 
hierarchical clustering is the initial technique to include 
data of nicroarray datasets to improve missing data 
computation [1]. Though, it is hard to discover data in the 
datasets and even further demanding to discover a set of 
genes often indicate expression resemblance to the target 
gene over numerous genes. In the meantime, centroid 
linkage, single linkage, complete linkage and average 
linkage are the foremost algorithm that exploits the useful 
similarities fixed in the yeast nicroarray data along with 
the expression similarities to enable the neighbor gene 
selection [2]. It outperformed k means, at high missing 
percentages, owing to the control of the amount and 
accuracy of the gene utilities interpreted in yeast data, 
Self-Organized Mapping and Principle Component 
Analysis algorithms miscarried to improve the time 
consumption in the computation process. 

To the understanding, first study has inspected the 
consequence of missing values and their computation on 
the maintenance of clustering results. Other studies 
determined missing values on K-Means (gene) clustering 
technique, Self-Organized Mapping and Principle 
Component Analysis computation method did not 
deliberate genetic analysis on the clustering results; their 
core outcomes were that even a small amount of missing 
values may intensely drop the steadiness of K-Means 
(gene) clustering technique, Self-Organized Mapping and 
Principle Component Analysis computation and 
hierarchical clustering algorithms evidently recover this 
steadiness [3]. Hence the outcomes are in worthy with 
these conclusions. 

The three steps to retrieve data are Loading, Filtering and 
Adjusting Data in clustering. Information in the form of 
dataset are loaded and processed as a Cluster. The four 
clustering methods such as centroid linkage, single 
linkage, complete linkage and average linkage are 
provided for adjusting and filtering the data that has been 
loaded. These methods gain access to Filter Data and 
Adjust Data. Filtering data permits to get rid of yeast gene 
expression datas that ensure not satisfy certain desired 
conditions. Adjusting data leads to perform conditional 
operations. The primary choice made essential is how 
similarity between yeast gene expression data expression 
data is to be well-defined. There are several methods to 
compute exactly how comparable two series of records 
are. Cluster provides eight options namely Correlation 
(uncentered), Correlation (centered), Absolute correlation 
(uncentered), Absolute correlation (centered), Spearman 
Rank correlation, Kendall’s tau, Euclidean distance and 
City block distance. 

II. RELATED WORKS 
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There are several computation techniques have been 
proposed since 1963, such as hierarchical grouping, 
hierarchical clustering, and since 2009 such as K-Means 
(gene) clustering technique. Self-Organized Mapping and 
Principle Component Analysis. [4, 5, 6, 7, 8, 9]. The most 
commonly used technique among these is the hierarchical 
clustering. However all of the methods of hierarchical 
such as centroid linkage, single linkage, complete linkage 
and average linkage measures are merely recognized on 
the yeast gene expression datasets themselves and employ 
nothing of the external microarray datasets or genetic 
associated data. Here numerous modest methods are 
present to determine the missing values, e.g. eliminating 
the genes with missing values from supplementary study, 
substituting missing values by zeros, or satisfying the 
missing values with the row or column means/medians 
present [10, 11, 12]. These methods are not ideal as they 
did not deliberate the relationship of the data, which 
stimulated the progress of further refined missing value 
ways that strained to exploit the data associations by 
means of the data present in the entire dataset [13]. 

As per data given in Table 1, missing value is a 
common difficulty that has to be addressed even for 
further modem educations [14, 15]. Likewise, here exists 
several genes with high missing percentages. In this 
circumstance, for genes with numerous missing values, 
little values are persisted to conclude in what way the 
gene is associated with other genes in the dataset, which 
leads to less accurate assessments. It is well known that 
gene expressions in cells are concertedly measured by 
similarity factors and information encoded in the nuclear 
and mitochondrial genomes of the yeast [16]. The major 
iterating unit of mitochondrial genomes, which consists of 
approximately 1000’s of microorganisms around Genome 
Database [17]. For instance as mentioned in [18, 19, 20], 
mitochondrial genomes might modify the stmcture. Thus, 
the similarity factor is greatly measured by the 
mitochondrial genomes states in mitochondrial. 
Nevertheless, definite objective existed to examine the 
consequence of missing values on the hierarchical 
clustering algorithms, such as centroid linkage, single 
linkage, complete linkage and average linkage, and to 
discover whether new progressive computation methods, 
such as SOM, will be able to offer improved clustering 
results than the old-style k means method. The outcomes 
recommend that hierarchical clustering runs fast, robust 
and accurate outcomes, particularly when the missing 
value rate is lower than 4%. None of the computation 
methods might sensible and correct for the stimulus of 
missing values above this 4% threshold. In these 
circumstances, one must think through in eliminating the 
genes with many missing values or iterating the tests if 


www.iiaers.com 


Page | 301 





International Journal of Advanced Engineering Research and 
httos://dx.doi.ora/10.22161/iiaers.5.12.41 

1908(0) 

likely. As prominent before, clustering related to datasets 
are naturally regularized therefore a data value near to 
zero shows the nonexistence of any relations in the midst 
of a pair of genes. Thus a simple key to the problem of 
missing values is to substitute those items with zeros. 
However this night give the impression to be a 
hierarchical cluster methodology, it has some validation: 
the probability is that maximum genes do not work 
together, and hence their relations score is probably to be 
close to zero. Likewise it is perceived that the mean 
/median of the non-missing entries in the datasets defined 
before is almost zero. This method helps as a starting 
point for investigational assessments. 

Loading, Filtering and Adjusting Data: A machine 
learning system is established for deciding gene functions 
from assorted source of data sets using hierarchical 
clustering. Through a prearrangement, in the Gtoup of 
input data tables rows signify genes and columns denote 
samples or interpreted values known as yeast data 
microarray hybridization. On performing the three steps 
to retrieve data namely Loading, Filtering and Adjusting 
Data in clustering, a small size Cluster input data 
resembles as in Table 1 [21]. 

Loading data: The YORF field contains an alpha¬ 
numeric value. It is forecasted in Tree View to state how 
the rows are connected. The left over chambers in the 
table contain data for the suitable gene and sample. The 
readings are observed as data for instance 1 at 0 min for 
YALOOlw and missing value for gene YAL001C at 2 
hours was 5.8. Omitted data are tolerable and are 
nominated by blank cells. In order to identify the missing 
value, the operation “Present % >=X” is enabled. 


Table. 1 YORF-Yeast open reading frame 


s. 

N 

0 

(YORF) 

o 

min 

30 

min 

1 

hou 

r 

2 

hou 

r 

4 

hou 

r 

1 

YAL001 

w 

i 

1.3 

2.4 

5.8 

2.4 

2 

YAL002 

w 

0.9 

0.8 

0.7 

0.5 

0.2 

3 

YAL003 

W 

0.8 

2.1 

4.2 

10.1 

10.1 

4 

YAL005 

C 

1.1 

1.3 

0.8 


0.4 

5 

YAL010 

C 

1.2 

1 

1.1 

4.5 

8.3 


The large size sample data file similar to small 
size sample data file as given in Table I comprises yeast 
gene expression data defined in Eisen et al. Move this 
data to testing and training in addition to loading the 
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Cluster bunch. Each Cluster bunch resolved will provide 
information roughly about the loaded data file. Once 
loaded, the listed, used and calculated measures such as 
Correlation (uncentered). Correlation (centered). Absolute 
correlation (uncentered), Absolute correlation (centered), 
Speannan Rank correlation, Kendall’s tau, Euclidean 
distance and City block distance are used as the testing 
and training statistics for different cluster analytical 
methods. Grouping is a significant tool for exploring such 
Cluster bunch of microarray information, usual properties 
of which are its intrinsic ambiguity, noise and fuzziness 
[22, 23, 24, 25, 26, 27, 28]. The columns and rows in the 
dataset are elective. Hence the Tree View practices to use 
the ID in YORF column by the means of labelling for 
each individual gene and YORF column permits to 
identify a label for each individual gene that is isolated 
after the ID is specified in the YORF column. The 31 
rows and 79 columns will be labelled well ahead in the 
dataset for loading purpose. The Filter Data permits to 
take out the genes that do not take part definitely sought 
after setting the properties of dataset. The properties such 
as enable and disable options are used to load, apply filter 
and accept filter as shown in Table II. 

Filter data: The filtering of data is the process of 
eliminating genes that abstain in certain preferred 
properties which is described in Table II. Also the 
presently accessible properties that can be capable to be 
used to filter data are existing [28]. These stay impartially 
understandable. As soon as filter are implemented, the 
filters are not instantly used in the dataset. Primarily the 
filter implementation expresses exactly how many genes 
would have been accepted by the filter. If accepted, genes 
passes through the filter, or else certainly no 
modifications are made. 


Table. II Eliminate genes lacking desired properties from 
dataset of 31 rows and 79 columns 


S.No 

Limitation 

Status 

1 

Present % >= 80=A 

Enabled 

2 

SD(Gene vector)>2.0=A 

Disabled 

3 

At least 1 observation 
with abs(Val)>=2.0=B 

Disabled 

4 

High Value-Low 
Value>=2.0=A 

Disabled 

5 

Apply filter 

21 passed out of 31 

6 

Accept Filter 

Enabled 


• Step 1 eliminates the entire genes that have 
missing numerical information in larger than 
(100 - A) percentage of the columns. 
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• Step 2 eliminates the entire genes that have 
normal abnormalities of detected numerical 
information lesser than A. 

• Step 3 eliminates the entire genes that do not 
have minimum of A interpretations analysed 
through total numerical information larger than 
B. 

• Step 4 eliminates the entire genes whose higher 
value subtracts the low value that are less than 

A. 

The genes are passing the filter when applying and 
accepting the filter. In order to filter data, the default 
value is set to read the result. Hence they are kept NIL as 
given in Equation 1 and Equation 2. 


Apply filter = NIL .Eq 1 

Accept filter = NIL .Eq 2 


The default values are presented in Table 3 for passing 
the genes through the filter. 


Table. Ill Assign default values to filter genes lacking 
desired properties from dataset of 31 rows and 79 
columns 


S.No 

Option 

Entry 

Value 

1 

Disabled 

% present>= 

80 

2 

Disabled 

SD (Gene vector) 

2.0 

3 

Disabled 

At Least 

1 

4 

N/A 

Observation with abs 

(val)>= 

2.0 

5 

Disabled 

High Value-Low 

Value >= 

2.0 


There are six conditions to pass the genes through the 
filter. They are illustrated as follows: 


Condition 1: After applying filter operation for the given 
dataset with an assigned default value as given in Table 
III, then the numerical infonnation in the entire 31 rows 
passes out of 31 rows without any missing information. It 
is found that there are no missing values. This is proved 
by identifying the result through the gene cluster tool. 
Hence the result is presented in Table IV. 

Table. IV Identifying genes lacking desired properties 


from dataset of 31 rows and 79 columns > 100-80 


S.No 

Option 

Entry 

Value 

1 . 

Enabled 

% present>= 

80 


Condition 2: Next, if the genes have %present >=80, then 
the result shows that it has no missing information and 
also filtering task is not further necessary while passing 
the genes. 

Condition 3: This condition where, if the abnormality of 
the Standard deviation, SD (gene vector) is enabled, none 
of the numerical information passes out of 31 rows. Then 
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all the detected numerical information less than 2.0 (SD) 
are removed. 

Condition 4: The genes are passed for at least 1 
observation by means of total absolute value as given in 
the Equation 3, where the abs(Val) is larger than 20, 
which allows 3 rows to pass out of 31 rows. 

abs(Val)>20 Eq 3 

Condition 5: The filtered gene for the High value is 
subtracted fromLow Value as given in Equation 4. 

High Value-Low Value > = 2.0 Eq 4 

This condition also passes 3 rows out of 31 rows similar 
to condition 4. 

Condition 6: If the filtered genes have high value as 
given in Equation 5 then the filter passes 21 rows passed 
out of 31 rows. 

High Value>=20 Eq 5 

Finally, the filter process is accepted for condition 3, 4, 5 

and 6 in order to accept filtering rows further. 

Adjust Data-Units mean: There are five number of tasks 
used to adjust the information and the tasks are performed 
by modifying the original information. The information is 
adjusted interms of log transfonn data, center gene-mean, 
center arrays-mean, normalizing gene and normalizing 
arrays subsequently the middle gene and middle array 
imperative process has its median for an assessment to 
fine-tune infonnation. 

HI. PROPOSED STUDY ON CLUSTERING FOR 
SMALL SAMPLE SET - HIE RARCHICAL 
(GENE) CLUSTERING 

The procedures for establishing hierarchical clusters 
are of commonly private subgroups (genes and arrays). 
An individual of private subgroups which has members 
that are extremely alike with an esteem are used to 
identify features integrating nearest neighbour searching 
algorithm. These weights are determined in addition to 
grouping [29,30]. Then the cutoff value (0.1) and the 
exponent value (1) are set as a default value and the 
similarity metric measure, correlation uncentered is 
chosen for determining the weights. The correlation 
(uncentered) metric is the one that rely on centroid 
linkage where a vector is assigned to compute the 
distance. The distances are computed with the centroid 
linkage method that will cluster and generate the cluster 
bunch. Firstly, the gene tree file (.gtr) is generated with 
node and gene value with its exponent. Secondly, an array 
tree (.atr) disk image (a copy of 8 bit formatted disk) file 
is generated with node and its array value with the same 
exponent 1. Thirdly, a coral draw text editor image 
template (.cdt) is generated with the E weight (exponent 
weight) of G weight (Gene Weight). The similar 
performance process of generating files for the centroid 
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linkage method in hierarchical clustering is followed to 
single linkage method, complete linkage method and 
average linkage method. For instance, the centroid 
linkage method involves two node and two gene value for 
generated gene tree as shown in Table V (as sample 1). 


Table. V Node gene Sample 1 


Node 

Gene Matrix 

Range 

Node lx 

Gene Ox 

Gene lx 

-0.527353 

Node 2x 

Gene lx 

Gene 2x 

-0.94495 


The interference for the single linkage method is derived 
as given in Table VI (ie.,sample2). 


Table. VI Node gene Sample 2 


Node 

Gene Matrix 

Range 

Node lx 

Gene Ox 

Gene lx 

-0.527353 

Node 2x 

Gene lx 

Gene 2x 

-0.611316 


The rest of the files are similarly generated for centroid 
linkage and single linkage method. The complete linkage 
method differs in value from others. It generates the value 
as given in Table VII (ie., sample 3). 


Table..VII Node gene Sample 3 


Node 

Gene Matrix 

Range 

Node lx 

Gene lx 

Gene Ox 

-0.527353 

Node 2x 

Gene 2x 

Gene lx 

-0.819574 


For average linkage method, gene tree file is generated as 
given in Table VIII (ie., sample 4) showing one different 
value for the second node similar to other two methods. 


Table.VIII Node gene Sample 4 


Node 

Gene Matrix 

Range 

Node lx 

Gene Ox 

Gene lx 

-0.527353 

Node 2x 

Gene lx 

Gene 2x 

-0.715445 


After performing hierarchical clustering, k-means 
clustering is chosen forevaluation. The similar dataset of 
Eigen which is fed for hierarchical clustering is used in k- 
means clustering. 

K-Means (gene) clustering technique: The genes and 
arrays of the dataset are analysed using the k-mean 
clustering algorithm Both genes and arrays have 10 
numbers of cluster k and 100 numbers ofmns each where 
the k-means and k-medians are determined. On execution 
of k-means with the Euclidean distance similarity metric 
for both gene and array, it is found that clusters are 
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available more in number than the genes. Then the entire 
dataset is passed without any gene filter irrespective of 
number of observations or absolute value specification. 
Also, the data is adjusted and it is independent of 
hierarchical technique. After execution, the cluster k 
generates a cluster gene file (.kgg) where gene groups 10 
clusters and the data in open reading frame (ORF) is a 
.kgg file and .kag file. It groups the gene into 10 groups 
and Cluster, k for 10 gene and 10 array are listed with 
gene weight and experiment weight. 

Self-Organized Mapping and Principle Component 
Analysis: After the execution of k-means clustering 
technique, the same Eisen dataset is tested in Self 
Organized Mapping (SOM) and Principle Component 
Analysis (PCM). The SOM organizes the genes and 
arrays similar to k-means clustering. The X dimension 
and Ydimension are assigned for the genes and arrays (as 
3). The number of iterations for genes by default is 1, 
00,000 and arrays is 20, 000 respectively. The initial tau 
is set to 0.02 by default and the outcome of both the genes 
and arrays of SOM are similar. The similarity metric here 
is the Euclidean distance and the three files generated of 
which CNF file shows the gene vectors and ANF file 
shows the array vectors. The gene/array file together 
shows the gene weight and experiment weight of the 
vectors. The mean values are not presented in the self- 
organized maps [31]. So the clustering technique of 
principle component analysis (PCA) is applied for Genes 
& Arrays to calculate the mean. PCA execution results in 
generating the principle component of array and gene. 
The gene and array are coordinating in two ways. The 
array co-ordinate is showing Eigen value of experiment 
weight and gene co-ordinate showing gene weight. All the 
clustering technique such as hierarchical, k-mean, self- 
organized mapping and PCA have adjusted the data to the 
mean. When adjusting data to median the result on filter 
data is as shown below. Hence the tata must be filtered 
before adjusting process. 

Filter data: Filtering the data with mean is similar to the 
process of filtering the data with median. 

Adjusting data with median for Atleast 1 observation 
with abs(val)>-2.0 

The difference discovered in filtering data with 
mean and median shows that when adjusting mean first 
and then filtering, shows no rows have passed out of 31 
rows. Adjusting median first and then filtering also shows 
no rows have passed out of 31 rows. When filtering gene 
for at least 1 observation with abs(val)>=2.0 shows 3 
rows passing out of 31 rows. The filter is being accepted 
to perform clustering after the rows are passed. Adjusting 
the data for the center gene and center array to mean and 
median respectively and vice versa filter no rows have 
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passed out of 31 rows. Adjusting data with median is 
similar to adjusting data with mean in log transform data 
and normalizing gene or arrays for center genes and 
center arrays respectively. 

IV. PROPOSED STUDY ON CLUSTERING 

FOR HIERARCHICAL (GENE) CLUSTERING 
TECHNIQUE - LARGE SAMPLE SET 

The various similarity metric performances are 
measured. They are: Correlation (uncentered), Correlation 
(centered). Absolute correlation (uncentered). Absolute 
correlation (centered), Spearman Rank correlation, 
Kendall’s tau, Euclidean distance and City block distance. 
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Table. IX Comparison between clustering methods 


Clustering method 

Gene/array similarity metric 

31 rows node/gene 

31rows node/array 

2467 rows node/gene 

2467 rows node/array 

Centroid linkage 

Correlation uncentered 

0.642641 

0.16757 

0.90082 

0.668934 

0.988387 

0.354391 

0.929455 

0.075474 

Single linkage 

0.642641 

0.336574 

0.90082 

0.722635 

0.988387 

0.414903 

0.929455 

0.288938 

Complete linkage 

0.642641 

-0.34663 

0.90082 

-0.805213 

0.988387 

-0.883172 

0.929455 

-0.489157 

Average linkage 

0.642641 

0.100977 

0.90082 

-0.110935 

0.988387 

-0.28906 

0.929455 

0.0223 

Centroid linkage 

Correlation centered 

0.640981 

0.123294 

0.896823 

-0.541497 

0.989404 

-0.606245 

0.926293 

-0.141204 

Single linkage 

0.640981 

0.287747 

0.896823 

0.418646 

0.989404 

0.961167 

0.926293 

0.287638 

Complete linkage 

0.640981 

-0.335755 

0.896823 

-0.750119 

0.989404 

-0.89763 

0.926293 

-0.520852 

Average linkage 

0.640981 

0.090961 

0.896823 

-0.082129 

0.989404 

-0.068484 

0.926293 

-0.018541 

Centroid linkage 

Absolute correlation uncentered 

0.642641 

0.167570 

0.900820 

0.063715 

0.988387 

0.094143 

0.929455 

0.054159 

Single linkage 

0.642641 

0.336574 

0.900820 

0.444248 

0.988387 

0.414903 

0.929455 

0.332774 

Complete linkage 

0.642641 

0.000931 

0.900820 

0.000000 

0.988387 

0.000000 

0.929455 

0.000056 

Average linkage 

0.642641 

0.130989 

0.900820 

0.158227 

0.988387 

0.114757 

0.929455 

0.092952 

Centroid linkage 

Absolute correlation centered 

0.640981 

0.123294 

0.896823 

0.018289 

0.989404 

0.013264 

0.926293 

0.071195 

Single linkage 

0.640981 

0.293646 

0.896823 

0.418646 

0.989404 

0.404155 

0.926293 

0.335699 

Complete linkage 

0.640981 

0.001184 

0.896823 

0.000083 

0.989404 

0.000000 

0.926293 

0.000074 

Average linkage 

0.640981 

0.117903 

0.896823 

0.152002 

0.989404 

0.126558 

0.926293 

0.087962 

Centroid linkage 

Spearman rank correlation 

0.693216 

-0.049660 

0.910012 

-0.274194 

0.973099 

-0.001144 

0.906171 

-0.126924 

Single linkage 

0.693216 

0.283253 

0.910012 

0.337878 

0.973099 

0.412512 

0.906171 

0.265874 

Complete linkage 

0.693216 

-0.414645 

0.910012 

-0.691423 

0.973099 

-0.818796 

0.906171 

-0.477460 

Average linkage 

0.693216 

0.064168 

0.910012 

-0.051662 

0.973099 

-0.024957 

0.906171 

-0.022292 

Centroid linkage 

Kendall's tau 

0.508900 

-0.056484 

0.746514 

-0.135484 

0.885758 

0.011360 

0.749915 

-0.085986 

Single linkage 

0.508900 

0.195261 

0.746514 

0.246734 

0.885758 

0.296595 

0.749915 

0.183322 

Complete linkage 

0.508900 

-0.267782 

0.746514 

-0.510871 

0.885758 

-0.636636 

0.749915 

-0.340330 

Average linkage 

0.508900 

0.044218 

0.746514 

-0.037265 

0.885758 

-0.002423 

0.749915 

-0.015095 

Centroid linkage 

Euclidean distance 

0.928197 

0.000000 

0.954213 

0.000000 

0.995196 

0.000000 

0.928997 

0.000000 

Single linkage 

0.914380 

0.000000 

0.924451 

0.000000 

0.991290 

0.000000 

0.894987 

0.000000 

Complete linkage 

0.950895 

0.000000 

0.981846 

0.000000 

0.998144 

0.000000 

0.978606 

0.000000 

Average linkage 

0.936194 

0.000000 

0.964699 

0.000000 

0.995290 

0.000000 

0.956165 

0.000000 

Centroid linkage 

City block distance 

0.732687 

0.000000 

0.775965 

0.000000 

0.928988 

0.000000 

0.737505 

0.000000 

Single linkage 

0.712875 

0.000000 

0.690699 

0.000000 

0.909338 

0.000000 

0.675318 

0.000000 

Complete linkage 

0.774877 

0.000000 

0.867530 

0.000000 

0.960542 

0.000000 

0.854260 

0.000000 

Average linkage 

0.748284 

0.000000 

0.795720 

0.000000 

0.932525 

0.000000 

0.791840 

0.000000 


Table IX gives a comparison of similarity measure 
performance on different clustering methods. Also it 
helps in identifying the missing values of yeast which 
leads to determine the time complexity. 

V. RESULTS AND DISCUSSION 

Clustering gene and array with hierarchical 
technique sorts with similarity metric correlation 
(uncentered) for centroid linkage clustering method. 

It results in sorting from 0.642641 to 0.167570 
(node/gene) for instance. 


Table. X The codes for the methods 


Method 

code 

Hierarchical 

H 

Gene 

G 

Clustering 

C 

Gene Array 

GA 

Correlation (uncentered) 

CU 

Correlation (centered) 

cc 

Absolute correlation 

(uncentered) 

ACU 
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Absolute correlation (centered) 

ACC 

Spearman Rank correlation 

SRC 

Kendall’s tau 

KT 

Euclidean distance 

ED 

City block distance 

CBD 

Centroid Linkage 

CEL 

Single linkage 

SL 

Complete Linkage 

COL 

Average Linkage 

AL 

Cluster 

C 

Cluster Weights 

CW 


For single linkage the corresponding node/gene, 
node/array and the weights are presented in the tabulation 
for the method (H_G_C_CU_SL). 



Fig 1. Gene tree view 31rows 79columns 


For complete linkage and average linkage, 
H_G_C_CU_COL and H_G_C_CU_AL, the same 
evaluation is done as in centroid and single linkage. All 
these methods are tested for all the other similarity 
metrics and the performance is updated in Table V. For 
correlation centered, the corresponding procedure code 
H_GA_C_CC_CEL,H_GA_C_CC_SL,H_GA_C_CC_C 
OL and H_GA_C_CC_AL are used. The range of 
node/gene for H_GA_C_CU_CEL and 
H_GA_C_ACU_CEL are the same. The initial value of 
node/array range for H_GA_C_CU and H_GA_C_ACU 
are same in all four methods (centroid, single, complete 
and average). 

Table. XI 31 rows and2467 rows (79columns) - cluster 


range 


Clust 

er 

meth 

od 

R 

ange 

31 

rows 

node/ge 

ne 

31 rows 

node/arr 

ay 

2467 

row 

s node/ 

gene 

2467 

rows 

node/ 

array 

SR 

0.64264 

1 

0.9008 

0.9883 

87 

0.9294 

55 

CEL 

ER 

0.16757 

0.6689 

0.3543 

91 

0.0754 

74 

SL 

ER 

0.33657 

4 

0.7226 

0.4149 

03 

0.2889 

38 

CL 

ER 

0.34663 

-0.805 

0.8831 

7 

0.4891 

6 

AL 

ER 

0.10097 

7 

- 0.111 

0.2890 

0.0223 


6 


The small scale information involve the observations for 
only 31rows 79columns. On increasing the size to 2467 
rows 79 columns as given in Table XI, clustering 
performance is maintained in an effective way such that 
the Euclidian and city block distance measure with large 
dataset shows better outcome when compared to other 
similarity measures [32-34], The time taken to cluster 
data with the similarity measures ACC, SRC and KT are 
determined. Also the ACU, ACC, ED and CBD time 
computation is calculated for the gene/array cluster bunch 
that involve the weight of cutoff=0.1 and exponent=l for 
gene and arrays. Only few similarities and variations are 
noted in case of CU on comparing two values C and CW, 
the starting value range for the cluster is nearer to cluster 
weights for CEL. 

Table XII. 31 rows and 2467 rows (79 columns) - 


execution time 


Cluster Metrics 

Time (sec) 

31 

rows 

node/ 

gene 

31 

rows 

node/ 

array 

2467 

Rows 

node/ 

gene 

2467 

rows 

node/ 

array 

Correlation 

(uncentered) 

38 

35 

31 

34 

Correlation 

(centered) 

32 

34 

30 

33 

Absolute 

correlation 

(uncentered) 

30 

28 

29 

27 

Absolute 

correlation 

(centered) 

26 

28 

28 

26 

Spearman Rank 
correlation 

22 

25 

28 

24 

Kendall’s tau 

21 

23 

22 

22 

Euclidean 

distance 

10 

2 

4 

6 

City block 

distance 

7 

4 

5 

2 


On comparing the time taken to execute clustering using 
ED and CBD measure, it takes very less duration to 
process the data as given in Table XII. 

For comparing these techniques used in this work, a 
statistical test has been conducted. Z-test for testing 
equality of variance between the similarity measures has 
been used to test the hypothesis of equality of two 
population variances shows 6.25 for Correlation 
(uncentered) and 8.25 for Correlation (centered) and no 


www.iiaers.com 


Page | 307 






















































International Journal of Advanced Engineering Research and Science (IJAERS) [Vol-S, Issue-12, Dec- 2018] 

httos://dx.doi.ora/10.22161/iiaers.5.12.41 ISSN: 2349-649S(P) / 2456- 

1908(0) 


variances for Absolute correlation (uncentered). Absolute 
correlation (centered), Spearman Rank correlation, 
Kendall’s tau, Euclidean distance and City block distance 
when the sample size of each sample is 30 or larger. 

VL CONCLUSION 

Similar to CU, the SR for CC, ACU, ACC, SRC and 
KT similarity measures are same. The ER differs for CC, 
ACU, ACC, SRC and KT. In case of ED and CBD, the 
SR for cluster methods is different and ER is same. The 
time taken for KT alone takes more time to generate the 
output. The gene tree view for 31rows 79columns with x 
and y pixels, mask<0 and corr select cutoff=0.8 are 
shown in Figure 1. The colour indications are green - 
negative, black-zero, red-positive and gray missing. The 
gene tree view for 2467rows and 79columns have reduced 
missing values. Hence the data mining methods are 
studied and compared for measuring clustering 
performance for various methods. 

The future progress can be tested with same small and 
large sample yeast gene data for self-organized mapping 
and principle component analysis. It uses the similar 
process that has been used in hierarchical and k means 
clustering. Also the performance time can be reduced. 
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